Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
I just picked this course from University of Helsinki course catalog.
I am really excited to learn R, because I want to get new tools to help me to process data. I usually use Excel and as a software engineer it would be better to use my current skills at their maximum.
The first examples were really basic, but I found it interesting because the idea on learning this kind of programming language is to get things done quickly.
Github: https://github.com/plipplopplipplop/IODS-project
Github Pages: https://plipplopplipplop.github.io/IODS-project
Describe the work you have done this week and summarize your learning.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
rd <- read.table(".\\data\\learning2014.txt",
as.is = TRUE,
sep = "\t",
header = TRUE)
str(rd)
## 'data.frame': 183 obs. of 71 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ d_sm : int 15 13 15 13 16 19 17 15 19 16 ...
## $ d_ri : int 15 10 16 15 14 20 17 13 17 16 ...
## $ d_ue : int 13 12 11 14 14 18 12 11 16 16 ...
## $ deep : int 43 35 42 42 44 57 46 39 52 48 ...
## $ st_os : int 14 14 13 12 16 15 12 15 16 16 ...
## $ st_tm : int 13 8 16 13 13 14 6 17 18 12 ...
## $ stra : int 27 22 29 25 29 29 18 32 34 28 ...
## $ su_lp : int 10 9 7 7 8 8 5 10 7 10 ...
## $ su_um : int 11 12 8 11 12 10 5 11 6 12 ...
## $ su_sb : int 10 17 12 9 14 11 13 13 13 14 ...
## $ surf : int 31 38 27 27 34 29 23 34 26 36 ...
The data set is collected as a part of international research project where the study topic was to ask questions about teaching and learning. There were a total of 183 participants who answered the questionnaire. The questionaire was about how do attendee’s study in the course. Are they doing a deep learning, where they study hard and focus into details, or are they strategic where they plan their studies and work evenly through the course time. Last on is do they study on surface so they get the course easily done and move to next topic early.
There are 32 questions (in data set marked with STxx, SUxx, and Dxx) which are collected for three internal dimensions (deep, surface, and strategic). Each of the set of the questions (STxx, SUxx, and Dxx) are classified and later collected into collections which interpret higher idea behind each approach (d_xx, su_xx and st_xx). These classifications are used to create the overall points of each approach (values deep, surf, and stra).
Assistant questions are asked in the questionnaires, which are measured with Aa-Af, Ca-Ch, and Da-Dj. Additionally the basic data are also stored into this data set (age, gender, attitude and total points).
myplot <- ggpairs(rd,
columns = c(57, 58, 59, 64, 67, 71),
mapping = ggplot2::aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
myplot
The charts are colour coded by the gender (red being Female). With this chart the reader can see distribution of questionnaire points depending on the attendee’s gender, age, and attitude to statistics
The students were quite young and the attitude was a little bit better with Males. Overall points were quite good and the deep approach to studying was the most common one. The strategic method was used in combination of deep learning but the surface learning approach was not that common.
diagplots <- lm(points ~ surf + stra + deep, rd)
summary(diagplots)
##
## Call:
## lm(formula = points ~ surf + stra + deep, data = rd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.222 -2.516 1.750 5.531 13.291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.12137 6.81477 3.980 0.00010 ***
## surf -0.17294 0.10303 -1.678 0.09500 .
## stra 0.28027 0.10246 2.735 0.00686 **
## deep -0.17222 0.09888 -1.742 0.08326 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.46 on 179 degrees of freedom
## Multiple R-squared: 0.06432, Adjusted R-squared: 0.04864
## F-statistic: 4.101 on 3 and 179 DF, p-value: 0.007617
I chose three explanatory variables: deep, surf and stra.
I chose these variables because they are the collection of the set of questions. These three variables have relationship towards the total points of the questionnaire.
Residuals are essentially the difference between the actual observed response values. In this case what is the largest variation of the points compared into the total amount of points.
Estimate is the individual approach points which are estimated to be in line with the total amount of points. Std. Error is the average error value of the approach points. Coefficient t values count how far away the points are from total points. Last the Pr is how likely it will deviate from t value.
Multiples R-squared model takes every unknown parameter to account, where adjusted R-squared removes those which are far away from the linear model. 6% variance means that the model fits well with the actual data.
par(mfrow = c(2,2))
plot(diagplots, which = c(1, 2, 5))
Not a glue what these are.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 02:37:35 2021"
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
url <- "https://github.com/rsund/IODS-project/raw/master/data"
url_alc <- paste(url, "alc.csv", sep = "/")
alc <- read.csv(url_alc, header = TRUE)
dim(alc)
## [1] 370 51
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
This joined data set contains alcohol consumption of students in two Portuguese schools. Data was collected using school reports and questionnaires. Data set contains 370 results from 29 sets of data.
I chose the following four variables for personal hypothesis on alcohol consumption: 1. romantic 2. higher 3. internet 4. schoolsup
My hypothesis is that a romantic relationship and higher education decreases the alcohol consumption, but lacking the support from school and not having internet connection at home (need to stay long at campus) increases the alcohol consumption.
phPlot <- ggplot(alc, aes(x = romantic, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("With a romantic relationship")
As seen in this box plot. The alcohol consumption didn’t drop that much when person was in romantic relationship. Female workday usage slightly dropped, but with males, the consumption slightly increased on the less consuming males.
phPlot <- ggplot(alc, aes(x = higher, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Wants to take higher education")
As assumed the higher education lowered the alcohol consumption, with males. But the alcohol consumption on females who want to take higher education, is higher. Which is total opposite on my assumption.
phPlot <- ggplot(alc, aes(x = internet, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Internet access at home")
Here is second surprising results. Internet increases alcohol consumption of female persons. With males the median staid the same, but the consumption got more dispersed on the maximum and lower ends.
phPlot1 <- ggplot(alc, aes(x = schoolsup, y = alc_use, col = sex))
phPlot1 + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Extra educational support")
This was the only assumption which went as I thought. More educational support students get, the lower is their alcohol consuption. It was a larger drop on both genders, but I assumed it was so.
m <- glm(high_use ~ romantic + higher + internet + schoolsup, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ romantic + higher + internet + schoolsup,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2857 -0.8807 -0.7814 1.5065 1.7737
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2511 0.5980 0.420 0.675
## romanticyes -0.2831 0.2547 -1.112 0.266
## higheryes -1.2242 0.5263 -2.326 0.020 *
## internetyes 0.2261 0.3289 0.688 0.492
## schoolsupyes -0.3105 0.3574 -0.869 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 444.76 on 365 degrees of freedom
## AIC: 454.76
##
## Number of Fisher Scoring iterations: 4
In the summary we can see that the variable higher is the only one which p-value is under 5%, so it can be considered as a significant. Now we can confirm on the charts on my personal hypothesis that the only really significant chart was indeed the variable higher. From the deviance we can see that high alcohol usage is not that common within 15-22 years old, therefore it is more on the minimum side. But there are variation between min and max.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.2854900 0.3969561 4.2627835
## romanticyes 0.7534579 0.4527659 1.2317868
## higheryes 0.2939864 0.1007957 0.8226494
## internetyes 1.2537265 0.6707657 2.4542330
## schoolsupyes 0.7330743 0.3506197 1.4390072
As you can see from odds ratio the variable higher is the only one which is really dispersed. The other variables are quite static, without that much variation. The coefficient intervals show that there are big chunks of values are in the same area, but there are this small set of data which is dispersed away from each other.
prob <- predict(m, type = "response")
alc <- mutate(alc, probability = prob)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 103 8
table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table() %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.27837838 0.02162162 0.30000000
## Sum 0.95945946 0.04054054 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2972973
As you can see from the data, the prediction about the high alcohol usage is quite accurate when the high alcohol consumption is TRUE. When it is FALSE, well then it is not that accurate.
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.3027027
Using the same method that in Datacamp. With the last 10-fold cross validation I got the same variation 0,5%. Although the predictions were 3% better.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 02:37:38 2021"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
dim(Boston)
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Boston data frame is a sample R dataset from R library MASS. The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
par(mfrow=c(7,2), fin=c(4,4), mar=c(2,5,2,1), las=1, bty="n")
plot(Boston$crim, ylab = "Per capita", main = "Crime Rate")
plot(Boston$zn, ylab = "lots over 25,000 sq.ft", main = "Land zoned for lots")
plot(Boston$indus, ylab = "acres per town", main = "Non-retail business acres")
plot(Boston$medv, ylab = "Value of homes", main = "Owner-occupied homes")
plot(Boston$nox, ylab = "parts per 10 million", main = "NOX concentration")
plot(Boston$rm, ylab = "number of rooms", main = "Rooms per dwelling")
plot(Boston$age, ylab = "built prior to 1940", main = "Owner-occupied units built")
plot(Boston$dis, ylab = "weighted mean", main = "Distances to employment centres")
plot(Boston$rad, ylab = "index", main = "Accessibility to highways")
plot(Boston$tax, ylab = "tax rate per 10,000", main = "Property-tax rate")
plot(Boston$ptratio, ylab = "ratio", main = "Pupil-teacher")
plot(Boston$black, ylab = "1000(Bk - 0.63)^2", main = "Proportion of blacks")
plot(Boston$lstat, ylab = "Percent", main = "Lower status")
Compared to crime rate. The higher crime rate seems to be in rented apartments, with higher property tax rate. The owners doesn’t live in the same area. Also higher NOX rate (car pollution) and closer to highway seems to be increasing the crime rate. Employment centers are far away. On the other hand the crime rate is not relevant to skin color or apartment room count.
# Make standardized dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
# Create categorical variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# Drop the crime rate from old data and add new categorial
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
dim(train)
## [1] 404 14
Comparing the two summaries tells us that scaling doesn’t work on the most of the variables. When scaling, there are variables which can’t be negative. After scaling these variables are negative, because no clustering has being done.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes, pch = classes, xlab = "Crime Rate", ylab = "Other factors")
lda.arrows(lda.fit, myscale = 2)
From this data we can see that accessibility to radial highways affects the crime rates the most. Nitrogen oxide concentration also has a noted link in the data for incrieasing the crime rate. Instead when proportion of residential land zones increases, the crime rates are lower.
crimeCat <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = crimeCat, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 9 14 4 0
## med_low 8 15 7 0
## med_high 0 7 11 1
## high 0 0 1 25
High crime rate predictions are quite consent and it seems to be more variance on the low and med_low predictions.
data("Boston")
Boston <- scale(Boston)
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
I came up with four clusters. The reason for this is that from the charts the the clusters are not that dispersed and there are big centers from all the charts, which can be combined. With five clusters, there were one too small cluster or it was too dispersed. With three, there was even clusters, but with four there are also four quite even clusters. More is better.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$centers)
Both give the same value, but with crime, the values are colored.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 02:37:49 2021"
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(tidyr)
library(ggplot2)
data("tea")
human <- read.csv(file = ".\\data\\human2.csv", header = TRUE)
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ X : chr "Norway" "Australia" "Switzerland" "Denmark" ...
## $ edu2R: num 1.007 0.997 0.983 0.989 0.969 ...
## $ labR : num 0.891 0.819 0.825 0.884 0.829 ...
## $ life : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ mmr : int 4 6 6 5 6 7 9 28 11 8 ...
## $ abr : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI can also be used to question national policy choices, asking how two countries with the same level of GNI per capita can end up with different human development outcomes.
This data set is partial collection of dataset which calculated the HDI country by country. The human data set contains the following values:
pca_human <- prcomp(human[-1])
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 2)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "GNI based PCA")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
pca_human <- prcomp(scale(human[-1]))
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 2)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Scaled PCA")
When comparing standardized and not standardized biplot charts you can clearly see that the scaling is off. The values of the gni are really large, because at data mangling the comma was taken out and textual value was transferred into numberic. This changed the values into almost 100 000. Without scaling, these gni values are really big compared to other data so the biplot is really off. You can see from not standardized one that the only arrow is gni, and others are so small that they are not even drawn.
There is a colleration between working men and women and men and women in parliament seats. Both of these factors raise life expency of Human Developmen Index.This is true, because it is correlated to gni which is basically income. When having more income, you have better service and has better way to survive. But it is only 17% of the factors calculated. Comparing data from multiple countries, many women are working and are in parliament.
Rest of the measured values are 53,6% of the reasons which affect the HDI. Maternal Mortal Ratio and Adolescent Birth Rate are on the positive side and it is getting better. On the other hand education and life expetancy and GNI are lacking behind. People die young and are not that educated.
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
tidyr::gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
mca <- MCA(tea_time, graph = FALSE)
plot(mca, invisible=c("ind"), habillage = "quali")
I chose a dataset which focused on the tea, which people are consumed. Are the bags or loose tea, with sugar or without, where bought, which type, at lunch, or do people consume tea with addition flavours like lemon or milk.
As you can see from MCA factor map, and from the bar charts on the top. The data can be analysed. With MCA factor map you can see that closer to origo, the common it is. Doing the MCA factoring, you can see the compared data of how the values affect each other. Here you can see that Earl Gray is not that away from black tea and the tea is also consumed quite often at lunch time. Milk also pays a larger role than originally expected.
date()
## [1] "Mon Dec 13 02:37:52 2021"
library(ggplot2)
library(tidyr)
library(dplyr)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATS
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
## 7 7 1 275 275 260 270 273 274 276 271 282 281 284
## 8 8 1 245 255 260 268 270 265 265 267 273 274 278
## 9 9 2 410 415 425 428 438 443 442 446 456 468 478
## 10 10 2 405 420 430 440 448 460 458 464 475 484 496
## 11 11 2 445 445 450 452 455 455 451 450 462 466 472
## 12 12 2 555 560 565 580 590 597 595 595 612 618 628
## 13 13 3 470 465 475 485 487 493 493 504 507 518 525
## 14 14 3 535 525 530 533 535 540 525 530 543 544 559
## 15 15 3 520 525 530 540 543 546 538 544 553 555 548
## 16 16 3 510 510 520 515 530 538 535 542 550 553 569
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# Convert data to long form
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD, 3, 4)))
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
RATS are about rat weigh measure over 1-64 Days time span. There are 16 individual rats, which are categorized into three groups. The previous data presents the measurement of each group and each individual rat in the group over the 64 days. The data is consistent and the rats are really close to each other in each group. Except Group 2, where a single rat is substantially over weighted compared to other rats in the same group.
# Standardize the data
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized bprs")
Standardization assumes that some of the rats to keep the weight as is.
n <- RATSL$Time %>% unique() %>% length()
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Overall we can see from the previous chart that the group one (smallest rats) are consistently keeping the weight gain in 10%. Same goes with group 3. But the group two gain additional weight 15%. This improvement can be tolerated with the single larger weighted mouse.
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
RATSL8S <- RATSL8S %>% filter(mean < 550 & mean > 250)
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Rat Weight)")
When checking the box plot, it can be seen that even thought the weight span is not that large, the larger weighted rats tend to be gaining more weight instead of loosing. This can be seen with the median value of the box plots. The light weighted rats are getting lighter while in other groups the trend is up.
BPRS
## treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 42 36 36 43 41 40 38 47 51
## 2 1 2 58 68 61 55 43 34 28 28 28
## 3 1 3 54 55 41 38 43 28 29 25 24
## 4 1 4 55 77 49 54 56 50 47 42 46
## 5 1 5 72 75 72 65 50 39 32 38 32
## 6 1 6 48 43 41 38 36 29 33 27 25
## 7 1 7 71 61 47 30 27 40 30 31 31
## 8 1 8 30 36 38 38 31 26 26 25 24
## 9 1 9 41 43 39 35 28 22 20 23 21
## 10 1 10 57 51 51 55 53 43 43 39 32
## 11 1 11 30 34 34 41 36 36 38 36 36
## 12 1 12 55 52 49 54 48 43 37 36 31
## 13 1 13 36 32 36 31 25 25 21 19 22
## 14 1 14 38 35 36 34 25 27 25 26 26
## 15 1 15 66 68 65 49 36 32 27 30 37
## 16 1 16 41 35 45 42 31 31 29 26 30
## 17 1 17 45 38 46 38 40 33 27 31 27
## 18 1 18 39 35 27 25 29 28 21 25 20
## 19 1 19 24 28 31 28 29 21 22 23 22
## 20 1 20 38 34 27 25 25 27 21 19 21
## 21 2 1 52 73 42 41 39 38 43 62 50
## 22 2 2 30 23 32 24 20 20 19 18 20
## 23 2 3 65 31 33 28 22 25 24 31 32
## 24 2 4 37 31 27 31 31 26 24 26 23
## 25 2 5 59 67 58 61 49 38 37 36 35
## 26 2 6 30 33 37 33 28 26 27 23 21
## 27 2 7 69 52 41 33 34 37 37 38 35
## 28 2 8 62 54 49 39 55 51 55 59 66
## 29 2 9 38 40 38 27 31 24 22 21 21
## 30 2 10 65 44 31 34 39 34 41 42 39
## 31 2 11 78 95 75 76 66 64 64 60 75
## 32 2 12 38 41 36 27 29 27 21 22 23
## 33 2 13 63 65 60 53 52 32 37 52 28
## 34 2 14 40 37 31 38 35 30 33 30 27
## 35 2 15 40 36 55 55 42 30 26 30 37
## 36 2 16 54 45 35 27 25 22 22 22 22
## 37 2 17 33 41 30 32 46 43 43 43 43
## 38 2 18 28 30 29 33 30 26 36 33 30
## 39 2 19 52 43 26 27 24 32 21 21 21
## 40 2 20 47 36 32 29 25 23 23 23 23
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks, 5, 5)))
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) + geom_point(aes(col = treatment))
BPRS data set contains 40 individuals who are using two treatments (Treatment A: 20 individuals, Treatment B: 20 Individuals). Each of the individuals use the treatment for 8 weeks and data is collected from that. Each of the individual are measured with BPRS:
The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
As you can see from previous plot, the given treatments have clear side affects at the beginning of the measurement period, but the treatment 1 loses many of the side effects, or they get milder. But with treatment 2 the side effects doesn’t stabilize that much on several individuals.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_boxplot() +
scale_x_continuous(name = "Week (1-8)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
When examining the same data with a box plot, then the overall amount of side effects can be seen more clearly. The side effects are clearly visible until week 5, from after the side effects start to dissipate. Still there are side effects and combining this data to previous plot, we can understand that the treatment 2 is not that good as wanted.
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
pairs(BPRSL[-3], pch = 19, col = BPRSL$treatment)